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Abstract 

Tensor rank and low-rank tensor decompositions have many applications in learning and 
complexity theory. Most known algorithms use unfoldings of tensors and can only handle rank 
up to for ap-th order tensor in R"*’. Previously no efficient algorithm can decompose 3rd 

order tensors when the rank is super-linear in the dimension. Using ideas from sum-of-squares 
hierarchy, we give the first quasi-polynomial time algorithm that can decompose a random 3rd 
order tensor decomposition when the rank is as large as j poly log n. 

We also give a polynomial time algorithm for certifying the injective norm of random low 
rank tensors. Our tensor decomposition algorithm exploits the relationship between injective 
norm and the tensor components. The proof relies on interesting tools for decoupling random 
variables to prove better matrix concentration bounds, which can be useful in other settings. 


1 Introduction 

Tensors, as natural generalization of matrices, are often used to represent multi-linear relationships 
or data that involves higher order correlation. A p-th order tensor T E is a p-dimensional 
array indexed by [n]P. A tensor T is rank-1 if it can be written as the outer-product of p vectors 
T = oi (8) • • • (8)ap, where Oj E R"" (for i = 1,... ,p). Equivalently, = 0^1 where 

denotes the ij-th entry of vector Oj. 

Low rank tensors — similar to low rank matrices — are widely used in many applications. The 
rank of tensor T is defined as the minimum number m such that T can be written as the sum of m 
rank-1 tensors. This agrees with the definition of matrix rank. However, most of the corresponding 
tensor problems are much harder: for p > 3 computing the rank of the tensor (as well as many 
related problems) is NP-hard [HasQOl IHL09| . Tensor rank is also not as well-behaved as matrix 
rank (see for example the survey |Coml4] h 

Unlike matrices, low rank tensor decompositions are often unique [Kru77] . which is important 
in many applications. In special cases (especially when rank m is less than dimension n) tensor 
decomposition can be efficiently computed. Such specialized tensor decompositions have been the 
key algorithmic ideas in many recent algorithms for learning latent variable models, including mix¬ 
ture of Gaussians, Independent Component Analysis, Hidden Markov Model and Latent Dirichlet 
Allocation (see |AGH~*~14| i. In many cases tensor decomposition can be viewed as reinterpreting 
previous spectral learning results [Cha96[ lMR,n6( TAPH^ISI IAHK12] . This new interpretation has 
also inspired many new works (e.g. [AGHKl^ IBCMV141 IGHKlSj l. 
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A common limitation in early tensor decomposition algorithms is that they only work for the 
undercomplete case when rank m is at most the dimension n. Although there are some attempts 
to decompose tensors in the overcomplete case (m > n) |DL(Xin7[ IB(1MV14[ lABG+13| . all these 
works require 4-th or higher order tensors. In many machine learning applications, the number 
of samples required to accurately estimate a 4-th order tensor is too large. In practice algorithms 
based on 3rd order tensor are much more preferable. Therefore we are interested in the key question: 
are there any efficient algorithms for overcomplete 3rd order tensor decomposition? 

In the worst case setting, overcomplete 3rd order tensors are not well-understood. Kruskal |Kru77| 
showed the tensor decomposition is unique when the rank m < 1.5n — 1 and the components are 
in general position, but there is no efficient algorithm known for finding this decomposition. Con¬ 
structing an explicit 3rd order tensor with rank will give nontrivial circuit complexity 

lowerbounds |Str73] . while the best known rank bound for an explicit 3rd order matrix is only 
3n —O(logn) [AFTllj . 

For many of the learning applications, it is natural to consider the average case problem where 
the components of the tensor are chosen according to a random distribution. In this case [AG.114j 
give a polynomial time algorithm that can find the true components when m = Cn for any constant 
G > 0 (however the runtime depends exponentially on C). 

This paper also considers this average case setting and gives a quasi-polynomial algorithm for 
decomposing the tensor when m can be as large as The main idea of the algorithm is based on 
sum-of-squares (SoS) SDP hierarchy I fParOOl iLasOlj . see Section [2] and the recent survey |BS14| i. 
The main difficulty in handling overcomplete 3rd order tensors is that there is no natural unfolding 
(i.e. mapping to a matrix) that can certify the rank of the tensor. We can unfold a 4-th order 
tensor T into a matrix M of size x where -^(q, 12 ),(* 3 , 14 ) = However, unfolding 3rd 

order tensor will result in a very unbalanced matrix of dimension n x that cannot have rank 
more than n. Intuitively, the power of SoS-based algorithm is that it can provide higher-order 
“pseudo-moments” that will allow us to use nontrivial unfoldings. 

In particular, the key component of the proof is a way of certifying injective norm (see Section[2]) 
of random tensors, which is closely related to the problem of certifying the 2-to-4 norm of random 
matrices |BBH^ 12 | . Recently, there has been an increasing number of applications of SoS hierarchy 
to learning problems. [BKSI4] give algorithms for finding the sparsest vectors in a subspace, which is 
closely related to many learning problems. |BKS15] give a new algorithm for dictionary learning that 
can handle nearly linear sparsity, and also an algorithm for robust tensor decomposition.However 
their result requires a tensor of high order. [BM15j studies a related problem of tensor prediction, 
also using ideas of SoS hierarchies. 

1.1 Our Results 

In this paper we give a quasi-polynomial time algorithm for decomposing third-order tensors when 
the rank m is almost as large as and the components of the tensor is chosen randomly. More 
concretely, we define Pm,n to be a distribution of third order tensors of the following form: 

771 

i=l 

where the vectors ai G M” are uniformly random vectors in and is short for ai®ai®ai. 

Our goal is to recover these components afs. Since any permutation of afs is still a valid solution, 
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we say two decompositions are e-close if they are close after an arbitrary permutation: 

Definition 1 (e-close). Two sets of vectors and in M"" are e-close if there exists 

a permutation vr : [m] —>■ [m] such that ||a 7 r(i) “ ^ill ^ Two decompositions of the tensor T are 
e-close if their components are e-close. 

For tensors in distribution „ our algorithm can recover the decomposition as long as m <C 
n3/2. 

Theorem 1.1. Given a tensor T = ^ sampled from distribution when m <C 

there is an algorithm that runs in time Qnd yjith high probability returns a decomposition 

T PS 0.1-close to the true decomposition. 

Our result easily generalizes to many other distributions for ai (including a uniform random 
vector in unit sphere or a spherical Gaussian). 

The algorithm does not output a very accurate solution (the accuracy can be improved to e with 
an exponential dependency on 1/e). However it is known that alternating minimization algorithms 
can refine the decomposition once we have a nice initial point [AG,T14] : 

Theorem 1.2 f [AG,T14] i. Given a tensor T from distribution T>m,n (m <C r?!'^), and an initial 
solution that is 0.1-close to the true decomposition, then for any e > 0 (that may depend on n) 
there is an algorithm that runs in time poly(n, log 1/e) that with high probability finds a refined 
decomposition that is e-close to the true decomposition. 

Combining the two results we have an algorithm that runs in time rPOo?.'"-) polylog(l/e) that 
recovers a decomposition that is component-wise e-close to the true decomposition. 

Corollary 1.1. Given a tensor T = sampled from distribution T>m,n, when m <C for 

any e > 0 there is an algorithm that runs in time polylog(l/e) and with high probability 

returns a decomposition T ps YllLi ^ that is e-close to the true decomposition. 

The main idea in proving Theorem 11.11 is the observation that when the tensor is generated 
randomly from 'Dm^n, the true components are close to the maximizers of the multilinear form 
T(x, X, x) = j k&[n] Ti,j,kXiXjXk = The maximum value of T{x, x, x) on unit vectors 

||x|| = 1 is known as the injective norm of the tensor. Computing or even approximating the injective 
norm is known to be hard [CurO.ll IHM13] . A key component of our approach is a sum-of-square 
algorithm (see Section [2] for preliminaries about sum-of-square algorithms) that certifies that the 
injective norm of a random tensor from T>m,n is small. 

Theorem 1.3. For a tensor T in distribution Hm.n? when m <C with high probability the 
injective norm of T is bounded by 1 -|-o(l). Further, this can be certified in polynomial time. 

Our results (Theorem ll.ll and ll.3|) still hold when we are given a tensor T that is 1/ poly(n)-close 
to T in the sense that the spectral norm of an unfolding oiT — T is 0(1/poly log(n)). Theorem ll.2l 
(and hence Corollary II.ip requires a tensor T such that the unfolding of T — T has spectral norm 
bounded by e/poly(n). 


3 










Organization The rest of this paper is organized as follows; In Section [2] we introduce tensor 
notations and SoS hierarchies. Then we describe the main idea of the proof which relates tensor 
decomposition to the injective norm of tensor (Section [3]). In Section S] we give a polynomial time 
algorithm for certifying the injective norm of a random 3rd order tensor. Using this as a key tool in 
Section [5] we present the quasi-polynomial time algorithm that can decompose randomly generated 
tensors when m <C 

2 Preliminaries 

Notations In this paper we use || • || to denote the (.2 norm of vectors and the spectral norm 
of matrices. That is, ||n|| = ll^ll = sup||„||=i Note that we will be using the 

sum-norm instead of expectation norm ||n||exp = [vf] because the scaling of sum-norm is more 
natural for the tensor decomposition setting. We use (n, v) to denote the inner product of u and v. 
When A and B are two matrices, we use standard notation A < B to denote the fact that B — A 
is a positive semidefinite. For a m x n matrix U and a p x q matrix V, we define the Kronecker 
product U <SiV as the mp x nq block matrix 


U®V = 


UipV •• 

• Ui,nV 

UmpV •• 

• Um,nV 


We use O notations to hide dependencies on polylog factors in n and m. When we write f g 
we mean / < g'/O(poly log n). Throughout the paper high probability means the probability is at 
least 1 — 


Tensors Tensors are multi-dimensional arrays. In this paper for simplicity we only consider 3rd 
order symmetric tensors and their symmetric decompositions. For a third order symmetric tensor 
T, the value of Tjjfc only depends on the multi-set {i,j,k}, so (and more 

generally all the 6 permutations are equal). For a vector v G M”, we use G M” to denote the 
symmetric third order tensor such that = ViVjVk- Our goal is to decompose a tensor T as 

rp _ (g)3 

~ 2-^i=l “i • 

There is a bijection between 3rd order symmetric tensors and homogeneous degree 3 polynomi¬ 
als. In particular, for a tensor T we define its corresponding polynomial T{x, x, x) = k=i '^i,j,kXiXjXk- 
It is easy to verify that if T = af^ then T{x,x,x) = ■ 

The injective norm ||r||j„j is defined to be the maximum value of the corresponding polynomial 
on the unit sphere, that is: 

\\T\\inj := sup T{x,x,x). 

\\x\\=l 

It is not hard to prove when m <C and the tensor T is chosen from the distribution 'Dm,ni 
with high probability 1 — o(l) < ||T||j„j < 1 -|- o(l), and in fact the value r(x, x, x) is only close to 
1 if X is close to one of the components ai. We will give a (SoS) proof of this fact in Section E] 
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Sum-of-Square Algorithms and Proofs Here we will only briefly introduce the notations and 
key concepts that are used in this paper, for more detailed discussions and references about SoS 
proofs we refer readers to |BS14| (especially Section 2). 

Sum-of-squares proof system is a proof system for polynomial equalities and inequalities. Given 
a set of constraints {ri{x) = 0}, and a degree bound d, we say there is a degree d SoS proof for 
p{x) > q{x) if p{x) — q{x) can be written as a sum of squares of polynomials modulo rj(x) = 0, as 
defined formally below. 

Definition 2 (SoS proof of degree d). For a set of constraints R = {ri{x) = 0,..., rt(x) = 0}, and 
an integer d, we write 

pix) '^R,d q{x) 

if there exists polynomials /ij (x) for z = 0,1,..., f and gj (x) for j = 1,... ,t such that deg(/iQ {p{x) — 
q{x))) < d, deg{hi) < (i/2 (for i > 0) and deg{gjrj) < d that satisfy 

£ t 

/io(x)2(p(x) - q{x)) = '^hi{xf + '^rj{x)gj{x), 
i=l j=l 

We will drop the subscript d when it is clear form the context. 

Note that the constraints set can be easily generalized to a set of inequalities by adding auxiliary 
variables. For example, constraint r{x) > 0 can be implemented as r{x) = where z is an auxiliary 
variable. 

Many well-known inequalities can be proved using a low degree SoS proof, among them the 
most useful and important one is Cauchy-Schwarz inequality, which can be proved via degree-2 
sum of squares. Another one is that x'^Ax ^ ||A||||x|p. This is pretty useful when A is a random 
matrix where we can use random matrix theory to bound the spectral norm of A. 

In order to turn an SoS arguments into an algorithm, we often consider the pseudo-expectation. 
Just as we have expectations for real distributions, we think of pseudo-expectation as expecta¬ 
tions for pseudo-distributions that cannot be distinguished from true expectations using low degree 
polynomials. Pseudo-expectation can be viewed as a dual of SoS refutations. 

Definition 3 (pseudo-expectation). A degree d pseudo-expectation E is a linear operator that maps 
degree d polynomials to reals. The operator satisfies E[1] = 1 and E[p^(x)] > 0 for all polynomials 
p{x) of degree at most d/2. We say a degree-c? pseudo-expectation E satisfies a set of equations 
{ri{x) : i = 1... ,i} if for any i and any q{x) such that deg(rjg) < d, 

E [ri{x)q{x)] = 0 

By definition, if p{x) :<R^d q{x), and degree-c? pseudo-expectation satisfies R, then we can take 
pseudo-expectation on both sides and obtain E [p(x)] < E[(i'(x)]. We will use this property of 
pseudo-expectation many times in the proofs. 

The relationship between pseudo-expectations and SoS refutations can be summarized in the 
following informal lemma: 

Lemma 1 ( [ParOOt [Lasni| . c.f. |BS14| . informal stated). For a set of constraints R, either there is 
an SoS refutation of degree d that refutes R, or there is a degree d pseudo-expectation that satisfies 
R. Such a refutation/pseudo-expectation can be found in poly(tn‘^) time. 
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3 Relating Tensor Decompositions and Injective Norm 

In this section we introduce the main idea of our proof. Given a tensor T = from 

distribution Dm n, we first make some observations about its corresponding polynomial T{x, x, x) = 

When x = ai, we know T(ai,oi,ai) = 1 + ®i)^' Here conditioned on oi, the second 

term is a sum of independent random variables ((a*, oi)^). By the distribution Dm,n we know these 
variables have mean 0 and absolute value around Standard concentration bounds show 

when m <C with high probability r(ai, ai, ai) = 1 ± o(l). 

On the other hand, suppose a: is a random vector in the unit sphere, then T(x,x,x) = 
is again a sum of random variables. By concentration bounds we know for any par¬ 
ticular X, when m <C with high probability T{x,x,x) = o(l). This can actually be generalized 
to all vectors x that do not have large correlation with Uj’s using e-net arguments. 

Observation. For a random tensor T ~ Dm,n, when m = with high probability T{x,x,x) < 
1 -|- o(l) for ||x|| = 1. Further when T{x,x,x) is close to 1 the vector x is close to one of the 
components Oj’s. 

Later we will give a SoS proof for this observation. Based on this observation, if we want to 
find a component, then it suffices to find a vector x such that T{x,x,x) is close to 1. Using the 
idea of pseudo-expectations, we can do this in two steps: 

1. Find a pseudo-expectation IE[x] that satisfies the constraint ||x|p — 1 = 0 and maximizes 
E[T{x,x,x)]. 

2 . “Sample” from this pseudo-distribution with psuedo-expectations E to get a vector x such 
that T(x,x,x) ~ 1, in particular x will be close to one of the components Oj’s. 

In Section m we will prove the first part of the observation. In particular we show even though 
we are maximizing over pseudo-expectation E[x] (instead of real distributions over x), we can still 
guarantee the maximum value E[T(x,x,x)] is at most 1 -|- 1/logn with high probability. 

In Section [5] we give algorithms for finding a component given a pseudo-expectation E with 
K[T(x,x,x)] ~ 1. The main idea of our algorithm is similar to the robust tensor decomposition 
algorithm in [BKS15| : first we show there must be a component a* such that E[(ai,a:)'^] is large for 
a large d, then we use ideas in |BKS15j to find the component a,. 

4 Certifying Injective Norm 

In this section, we give Algorithm [T] based on SoS hierarchy that certifies the injective norm of 
random tensor. In particular, we will prove Theorem 11.31 which we restate in more details here. 

Theorem 4.1. Algorithmic always returns NO when ||r||j„j > 1 -|- 1/logn. When T ~ Dm,n find 

m <C Algorithmic returns YES with high probability over the randomness ofT. Further, the 

~ 2 ~ 
same guarantee holds given an approximation T where if M ^ is an unfolding of T — T, 

\\M\\ < l/21ogn. 

When ||r||inj > 1 -|- 1/logn, then by definition there must be a vector x* that satisfies ||x*|| = 1 
and T(x*,x*,x*) > 1/logn. We can take E to be the expectation of a distribution that is only 
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Algorithm 1 Certifying Injective Norm 
Input: A random 3-tensor T 

Output: If ||r||inj > 1 -|- 1/logn, return NO. If T ~ Vm,n{'fn <C then w.h.p. return YES. 

Solve the following optimization and obtain optimal value OPT 

Maximize E[T(x,x,x)] 

Subject to E is a degree-12 pseudo-expectation (1) 

that satisfies {r{x) = ||x|p — 1 = 0} (2) 

return YES if OPT < 1 -|- 1/logn and NO otherwise. 


supported on x* (i.e. with probability 1 x = x*). Clearly this pseudo-expectation is valid, and 
OPT will be at least larger than 1/logn. Hence the algorithm returns NO. 

For random tensor T, we hope to show that with high probability, the tensor norm is less than 
1 -I- 1/ log n can be proved via SoS. 

Theorem 4.2. With high probability over the randomness of the tensor T, for r{x) = ||x|p — 1, 

T{x,x,x) ^r -,12 1-b 0(m/n^/^) (3) 

Note that taking pseudo-expectation E on both hand sides of (l3|), for any degree-12 pseudo¬ 
expectation E that is consistent with r{x), 

E [T{x, x, x)] < 1 -b 0(mlr?l‘^) 

That is, when m <C n^/^, the objective value of the convex program in Algorithm [T] is less than 
1 -b 1/logn with high probability for random tensor. 

Now we need to prove Theorem 14.21 We first use Cauchy-Schwarz inequality to transform LHS 
of ([3]) to a degree-4 polynomial, which would then correspond to 4th order tensors and enable 
non-trivial unfoldings. 

Claim 1. 

m 

[r(x,x,x)]^ ^r,i 2 '^{ai,x)^ + '^{ai,aj){ai,xf{aj,xf . (4) 


' --- ' ' -- 

2-4 norm :=p{x) 

Proof. This is a direct application of Cauchy-Schwarz inequality: 


n / ™ \ ^ 

/ m \ 

2 

m 

2 

m 

(T-x®^) = f ^(ai,x)M = 

1 y^fai,xfai,x 


'^{ai,x)‘^ai 

i=l 

X p :<r 

'^{ai,xfai 

i=l 


Expanding this quantity, and using the fact that ||aj|| = 1, we get 


m 

^{ai,xfai 

i=l 


^(ai,x)^ -b '^{ai,aj){ai,x)^{aj,xf. 
i=l i^j 


( 5 ) 

□ 
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The first term is closely related to 2-to-4 norm of random matrices: let A E be a matrix 

whose rows are equal to Oj’s, then ||74||2^4 = sup|| 2 .||=;^ ||Aa;|| 4 . Clearly, ||^|||^4 = sup||j,||=;^ 
is the maximum value of the first term. This is considered in [BBH'*~12 where they gave a SoS 
proof that when m <C the first term is bounded by 0(1). Here we are in the regime m <C 
so we can improve the bound to 1 + o(l) (The proof is deferred to Appendix lA.ll) : 


Lemma 2. With high probability over the randomness of ai’s, 


^(a*, ^r,i2 1 + 0{mln^/‘^) (6) 

i=l 

The harder part of the proof is to deal with the second term p{x) on the RHS of (|3]). The naive 
idea would be to let y = and view p{x) as a degree-2 polynomial of y, 


Q{y) = '^{ai,aj){ai®ai,y){aj (g) aj,y) = y'^Ny. (7) 

Here N is an by random matrix that depends on afs. Suppose N has spectral norm less than 
o(l), then we have y^Ny ■< || A"!! ||y|p, and by replacing y = x(8)x we obtain p(x) = q{x®x) ■< o(l). 
However, in our case the matrix N have spectral norm much larger than o(l). 

Our key insight is that we could have different ways to unfold p{x) into a degree-2 polynomial. 
In particular, we use the following way of unfolding: 


(i{y) = '^{ai,aj){ai (g) aj,y){ai ®aj,y) = y'^My (8) 

where M is the by matrix that encodes the coefficients of q'{y), 

M = '^{ai,aj){ai (g) aj){ai (g) aj)'^ 
i¥=j 

It turns out that q'{y) still have the property that q'{x (g) x) = p{x). The matrix M has much 
better spectral norm bound, which leads us to the bound for p{x). 

Lemma 3. When m <C the matrix M = ® o,j){ai (g) has speetral norm at 

most 0{mjn^^'^) and as a direct consequence, 

p{x) :<r,4 Oirnlr?!'^) 

First we give an informal and suboptimal bound for intuition. Let B be the x matrix 
whose (i,j)-column {i,j € [m]) is Oj (g) aj (viewed as an dimensional vector). Then M can be 
written as M = B diag((ai, aj))i^jB'^. Note that B can also be written as A (g) A where (g) is the 
Kronecker product of two matrices, so we have ||i?|| = ||A|p < m/n. Then we can bound the norm 
of M by ||M|| < ||H|||| diag(6)||||H|| < (mfn) ■ maxjj \{ai,aj)\ ■ {m/n) < where we used 

the incoherence of afs, that is, \{ai,aj)\ < Xj^fn. This will only be o(l) when m < 

Intuitively, this proof is not tight because we ignored potential cancellation caused by the 
randomness of {ai,af). Note that {ai,af) have expectation 0, but we treated them all as positive 
Xj ^/n. If we assume that (a*, aj)’s are independent WXj ^fn, then M = aj){ai®aj){ai®aj)'^ 







would be a sum of PSD matrices with random weights and we can apply more standard matrix 
concentration bounds to make sure cancellations happen. 

However, (a,, Uj) are of course not independent and our key idea is to decouple the randomness 

of ) • 

Proof. (Sketch) We first replace the vectors afs with crjOj where cjj is a random ±1 variable. This is 
OK because the distribution of Oj and aiUi are the same. Now we first sample the afs, conditioned 
on the samples M = j{ 0 ^ 1 , 0 .j){ai (8* aj){ai ® o-jY' (where only Ui’s are still random). Now 

since the vectors afs are all fixed, the correlation between different terms only depends on scalar 
variables (TiUj, and we never use the term af (because i ^ j). 

By a result of |PMS95| . in this case we can decouple the product aiaj. In particnlar, in order 
to prove concentration properties for M, it suffices to prove concentration for a different matrix 
® aj){ai (8 CLjY. Here r G {±1}”^ is an independent copy of afs. In this way 
we have decoupled the randomness in cjj and Tj, and the rest of the Lemma can follow from careful 
matrix concentration analysis. □ 

We give the full proof of Lemma [3] in Appendix I A. 21 

Proof Sketch of Main Theorem Theorem 14.21 follows directly from Lemma [2] and Lemma [3j 
Using Lemma dl we get the main Theorem 14.11 in the noiseless case. When there is noise, since we 
have bounds on spectral norm of an unfolding of T—T, it implies (by Lemma fT^ [T—T]{x, x, x) :<r,i 2 
l/21ogn.it is easy to verify that T{x,x,x) = T{x,x,x) + [T — T]{x,x,x) ^r,i 2 1 + 1/logn, so 
Theorem 14. II still holds. We give more details in Appendix lA. 31 

5 Quasi-polynomial Time Algorithm for Tensor Decomposition 

In this section we give a quasi-polynomial time algorithm for decomposing random 3rd order tensors 
in distribution T>m,n- In particular, we prove Theorem 11.11 which we restate with more details below: 

Theorem 5.1. Let T be a tensor chosen from 'Dm,n, when m <C with high probability over 

the randomness of T Algorithmic returns {hi} that is 0.1-c/ose to {oj} in time Further, 

~ 2 ~ 
the same guarantee holds given an approximation T where if M G is an unfolding ofT — T, 

||A/|| < 1/lOlogn. 

A key component of our algorithm is a way of sampling pseudo-distributions given in [BKS15j : 

Theorem 5.2 (Theorem 5.1 in [BKS15| L For every k > 0, there exists a randomized algorithm with 
running time and success probability for the following problem: Given a degree-k 

pseudo distribution {n} overW^ that satisfies the polynomial constraint ||n|p = 1 and the condition 
E[(c, uY] > e~^^ for some unit vector c G M”, output a unit vector c' G M” with (c, d) >1 — 0{e). 

The basic idea of Algorithm [2] is as follows. At each iteration, the algorithm tries to find a new 
vector hi- As we discussed in Section [3l in order to find a vector close to a* it finds a vector x with 
large T{x, x, x) value. Moreover, It enforces that the new vector is different from all previous found 
vectors by the set of polynomial equations {(s, xY < 1/8 : s G S}. Intuitively, if we haven’t found all 
of the vectors afs any of the remaining afs will satisfy the set of constraints {(s, xY < 1/8 : s G 5} 
and T{x,x,x) > 1 — 1/logn. Therefore each time we can find a valid pseudo-expectation E. 
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What we need to prove is for any pseudo-expectation E we found, it always satisfies £[(«*, x)^] > 
e~'"^ for some k = 0((logn)/e) for some small enough constant e. Then by Theorem I5.2t ve can 
obtain a new vector that is 0(e)-close to one of the Uj’s. We formalize this in the following lemma: 


Algorithm 2 Overcomplete Random 3-Tensor Decomposition 

Input: Random 3-tensor T = YllLi ~ T^m,n- 
Output: oi,..., Um G K"’ s.t. {a*} is 0.1-close to {oj} 

1 : 5^0 

2: repeat 

3: Using semidefinite programming to find a degree k = O(logre) pseudo-expectation E that 

satisfies the constraints {T(x,x,x) > 1 — 1/logn, ||x|p = 1} and {(s,x)^ < 1/8 : s G S}. 

4: Run the algorithm in Theorem 5.1 of |BKS15j (for times) with input E and obtain 

vector c such that T(c,c,c) > 0.99. 

5: add vector c to S. 

6: until \S\ = m 
7: return {hi} = S. 


Lemma 4. When T is chosen from T>m,n where m <C with high probability over the random¬ 
ness ofT, the pseudo-expectation found in Step 3 of Algorithmic satisfies the following: there exists 
an Qi such that E[{ai,x)^] > e~'"^ for sufficiently small constant e (where the pseudo-expectation 
has degree Ak and k = 0((logn)/e)/. In particular, applying Theorem \5.Sl repeat the algorithm for 
time will give a vector c such that {c,ai) > 1 — 0(e). 

The main intuition is to use Cauchy-Schwarz and Holder inequalities (like what we used in 
Claim[TI) to raise the power in the sum (we start with d = 3 and hope to get to d = k). 

When the degree is high enough we can afford to do an averaging argument and lose a factor of m 
to go from the sum to a individual vector, because 6“*^^ = poly(m). The detailed proof is given in 
Appendix lB.il 

Now we are ready to prove Theorem 15.II 

Proof, (sketch) We prove Theorem 15.11 by induction. Suppose s already contains a set of vectors 
di’s, where for each di there is a corresponding aj that satisfies ||aj — aj\\ < 0.1. We would like 
to show with high probability in the next iteration, the algorithm finds a new component that is 
different from all the previously found afs. 

In order to do that, we need to show the following: 

1. The SDP in Step 3 of Algorithm [2] is feasible and gives a valid pseudo-expectation. 

2. For any valid pseudo-expectation, with high probability we get an unit vector c that satisfies 
r(c, c, c) > 0.99, and c is far from all the previously found afs. 

3. For any unit vector c such that T{c,c,c) > 0.99, there must be a component a* such that 
||aj — c|| < 0.1. 

In these three steps. Step 1 follows because we can take E to be the expectation of a true 
distribution: x = a* with probability 1 for some unfound a*. Step 2 is basically Lemma lU when 
we choose e to be a small enough constant, it is easy to prove that all the vectors that satisfy 
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(c, Oj) > 1 — 0(e) must satisfy T{c,c,c) > 0.99. Step 3 is the second part of our observation in 
Section [Sj which we prove in the appendix. □ 


The details in this proof can be found in Appendix IB.21 


6 Conclusion 

In this paper we give the first algorithm that can decompose an overcomplete 3rd order tensor when 
the rank m is almost that matches the bounds for even order tensors. Our argument is 
based on a special unfolding of the tensor and a decoupling argument for matrix concentration. We 
feel such techniques can be useful in other settings. 

Tensor decompositions are widely applied in machine learning for learning latent variable mod¬ 
els. Although the SoS based algorithm have poor dependency on the accuracy e, in the case of 
tensor decomposition we can actually use SoS as an initialization algorithm. We hope such ideas 
can help solving more problems in machine learning. 

Acknowledgment We thank Anima Anandkumar, Boaz Barak, Johnathan Kelner, David Steurer, 
Venkatesan Guruswami for helpful discussions at various stages of this work. 
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A Omitted Proofs in Section [4] 

A.l Proof of Lemma [2] 

We first restate the lemma here. 

Lemma 5. With high probability over the randomness of ai’s, 

m 

^(a*, <r,i2 1 + 

i=l 


( 9 ) 


Recall BBH~*~12 showed that when m <C 


n 


'^{ai,x)'^ < 0(1) 


( 10 ) 


2=1 


Here in order to improve this bound, we consider the square of the LHS of ([6]) and apply 
Cauchy-Schwarz (similar to Claim [1]), 


^ 2=1 


= ( y^fai,xfai,x 
\ 2=1 
m 

y^fai,xfai 


~< 




2=1 


J2{ai,xY 


2=1 


'^{ai,xf + '^{ai,aj){ai,xY{aj,x) 
i¥=j 


by Cauchy-Schwarz 

( 11 ) 


2=1 


We will bound the first term of (|llll by 1 -|-o(l). We simply let y = and let B be the matrix 
whose ith row is af^. Then /(y) = ||Ry|p has the property that /(x®^) = Therefore 

it suffices to prove that f{y) ^ (1 -|- o(l)||y|p or equivalently ||R|| < 1 -|- o(l). 

Consider the matrix BB^. It is a n by n matrix with diagonal entries 1 and off diagonal 
entries of the form (af^,a®^) = {ai,ajY. By the incoherence of afs, we have {ai,ajY ^ 

Then by Gershgorin disk theorem, we have ||Ri?^|| < 1 -|- 0{mln^/‘^) = 1 -|- (5. It follows that 
||R|| < 1-|-0(m/n^/^). Therefore, 


x)® = ||Rx®^|p ^ (1 + 0(m/n^/^))||x®^|| <r 1 + Oimlr?^'^) 


( 12 ) 


2 = 1 


For the second term of (Hip , we apply Cauchy-Schwarz again: 


'^{ai,aj){ai,xf{aj,xY\ ^ I ^(a*, aj)^(ai, x)^(aj, x)^ 1 I ^(a*, x)'‘(aj, x)^ 

J¥=j j \i¥=j ) \i¥=j 

Vi j J \ ^ j 


( 13 ) 
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Note that the matrix j 4 = [ai| ... \am] has spectral norm bound ||A|| < and therefore 


Y^{ai,xf = \\A^xf < Pf ||xf Pf 

i 

Then using Equation fTOl and the equation above, we have 

RHS of (USD i . — . — . 0(1) . 0(1) < O(mVn^) (14) 

n n n 

Then by [T3] and dH and Lemma fl^ we have that 

'^{ai,aj){ai,x)^{aj,x)^ 0{vr? jr?) (15) 

Hence, combining equation (fT^ . (fT^ and dm) we have that 

( m \ ^ m 

:<r'^{ai,xf+ '^{ai,aj){ai,xf{aj,xf (16) 

i=l / i=l i^j 

1 + 0{mln^^'^) + = 1 + 0{mln^^'^) 

Using Lemma [13] again, we complete the proof of Lemma El 

A. 2 Proof of Lemma [3] 

We hrst restate the lemma: 

Lemma 6. When m <C the matrix M = ® ® has spectral norm at 

most and as a direct consequence, 

p{x) ^r,4 OirnlrC’l'^') 

Proof. As suggested in the proof sketch, we first use a simple symmetrization which allows us to 
focus on the randomness of signs of {ai,aj). For simplicity of notation, let Qij := {ai,aj){ai ® 
Uj)(ai ® ■ Let a G {±1}™ be uniform random ±1 vector and define M' as 

Ad — ^ ^ CTiCTjQij . 

i¥=j 

We claim that M' has the same distribution as M, since a* has the same distribution as UjOj. 
Then from now on we condition on the event that afs have incoherence property and low spectral 
norm, that is, {ai,aj) < Ify/n, p|| = ||[ai|a 2 ... |am]|| ^ a/ m/n, and we will only focus on the 
randomness of u. Ideally we want to write M' as a sum of independent random matrices so that we 
can apply matrix Bernstein inequality. However, now the random coefficients are aiCTj, and they 
are not independent with each other. 

A key observation here is that the sum is only over the indices {i,j) with i y j, therefore we 
can use Theorem 1 of [PMS95j (restated as Theorem 1C.II in the end) to decouple the correlation 
first. 
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Theorem C.l basically says that to study the concentration of a snm of the form ^ • 
it is up to constant factor similar to the concentration of the sum where Yi is an 

independent copy of Xi. Applying the theorem to onr situation, we have that there exists absolute 
constant C snch that 

Pr[||M'|| >t]< CPr[M" > t/C] (17) 

where 

M" := y^^ajTjQij, 

and fj, T are independently uniform over { —Ij+l}™. 

Now it suffices to bound the norm of M". We proceed by rewriting M" as 

M" = X] o-i X] 'TjQij := CFiTi-, 

i j^i i 


where 

Ti := ^ TjQij (18) 

We study the properties of Tj first. 

Claim 2. With high probability over the randomness of Oi’s, for all i, Ti Y 0{y/rn/n){aiaj) ® I. 

Proof. Recall that Qij = {ai,aj){ai®aj){ai®a"j). In the definition fTHl of Ti, the index i is fixed and 
we take sum over j. Therefore it will be convenient to write Qij as Qij = {ai,aj){aiaj) (8) {ojOj)'^ 
where (8> is the Kronecker prodnct between matrices. Then Ti can be written as 

Ti = (aiof) (g) ^^^Tj{ai,aj)ajaJ 

We apply the Matrix Bernstein ineqnality (Theorem [C2]) on the right factor. Matrix Bernstein 
bound reqnires spectral norm bound for individual matrices, and a variance bound. 

For the spectral norm of individual matrices, we check that \\Tj{ai,aj)ajaJ\\ < '\.l\fn (by inco¬ 
herence). For variance we know 



\\^['^Tj(.i(^iWj)ajaJf]\\ = \\Adiag{{ai,ajf)j^iA^\\ 
j 

where we used the spectral norm of A and the fact that {ai,aj)‘^ < 1/re. 

Therefore by Matrix Bernstein’s inequality (Theorem [C2]) we have that whp, over the random¬ 
ness of r, 

\\''^^Tj{ai, aj)ajaj\\ < 0{\/m/n). 
j 

Using the fact that for two matrices P and Q, d P Y Q and R is PSD, then R<Si P P R® Q 
(see Claim [3]), it follows that 

Ti Y (aiof) ® {0{y/m/n) ■ I). 

Finally we use union bound and conclnde with high probability this is trne for any i. □ 
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Now we can apply matrix Bernstein for the sum M" = The individual spectral norm 

is bounded by 0[yfrnln) by the Claim [5J The variance is 

m ra 

II ^T^^ll < 0{m/r?)\\ ^((oiaf) (g) /)^|| = d{m/n^)\\{AA^) ® /|| = 0(m^/n^). 

i=l i=l 

Using matrix Bernstein inequality, we know with high probability ||M"|| < 0(rnlr?l‘^\ 

Using (IIZD, we get that whp, ||M'|| < Since M' and M has the same distribution, 

we conclude that whp, ||M|| < □ 

We complete the proof by providing the following claim about Kronecker products. 

Claim 3. If P ^ Q and R is psd, then R® P < R® Q. 

Proof. It suffices to prove this when R = uu^ (as we can always decompose R as sum of rank one 
components). In that case, for any y G M"’ , we can write y = u ^ v + z where z is orthogonal to 
u® Oi for all i £ [n]. Now {R ® P)z = 0, therefore 

y'^{R ® P)y = {u® v)'^{R ® P){u ® v) = Ru){v^Pv) < {u^Ru){v'^Qv) = y'^{R ® Q)y. 

Therefore R® P < R® Q. □ 

A.3 Main Theorem for Certifying Injective Norm 

Now we are ready to prove Theorem 14.11 

Theorem A.l. Algorithmic always returns NO when UTlIjraj > 1 + 1/logn. When T ~ Pm,n and 

m <C Algorithmic returns YES with high probability over the randomness of T. Further, the 

— 2 ~ 
same guarantee holds given an approximation T where if M £ is an unfolding of T — T, 

||M|| < 1/2logn. 

Proof. We first prove whenever ||r||inj > 1 + 1/logn, the algorithm returns NO. This is because 
a large injective norm implies there exists an unit vector x* with T{x*,x*,x*) = 1. We can 
construct a pseudo-expectation E as ]E[p(x)] = p{x*). Clearly this is a valid pseudo-expectation 
(it is even the expectation of a true distribution: x = x* with probability 1). Also, we know 
E[r(x,x,x)] = T{x*,x*,x*) >1-1- 1/logn, so in particular OPT > 1 -|- 1/logn and the algorithm 
must return NO. 

Next we show the algorithm returns YES with high probability when T is chosen from T>. This 
follows directly from Theorem 14.21 which in turn follows from Lemmas [2] and [3l In particular, we 
know there is a degree-12 SoS proof that shows T(x,x,x) < 1 -|- 0(m,/n^/^) < 1 -|- 1/2 log n, so by 
Lemma [T] this must also hold for any pseudo-expectation. 

When we are only given tensor T such that the unfolding of T — T has spectral norm 1/2 log n. 
Let M be the unfolding of T — T, and y = x ® x, then by Lemma HC we know (x^My)^ > 
||x|p||M|p||y|p, which implies (by Lemma [T^ [T — T](x,x,x) = x'^My >^,12 ||Tf|| < l/21ogn. 
Combining the two terms we know 

T{x,x,x) = T{x,x,x) -f- [T — T](x,x,x) Yr,i 2 1 + 1/logn. 


□ 
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B Omitted Proof in Section [5] 

B.l Proof of Lemma [4] 

We first restate the lemma here: 


Lemma 7. When T is chosen from T>m,n where m ^ with high probability over the random¬ 
ness ofT, the pseudo-expectation found in Step 3 of Algorithmic satisfies the following: there exists 
an Qi such that E[{ai,x)^] > e~'^^ for sufficiently small constant e (where the pseudo-expectation 
has degree Ak and k = 0{{logn)/e)). In particular, applying Theorem, \ 5. lA repeat the algorithm for 
time will give a vector c such that {c,ai) > 1 — 0(e). 

First we will show that for a valid pseudo-expectation, the sum of {ai,x)^ and {ai,x)^ are also 
bounded. This actually follows directly from the proof of Lemma [2] and [3j 


Lemma 8. With high probability over the randomness ofT, we have that for any degree-12 pseudo 
expectation E that satisfies the constraints {||a:|p = l,T(x,x,x) > 1 — t}, it also satisfies 


for e = 0(mlr?l‘^) -|- 0(r). 


l + e>E 


m 


> 1 - e 


l + e>E 




> 1 - e 


(19) 

( 20 ) 


Proof. We essentially just take pseudo-expectation on the SoS proofs for Lemma [2] and [3l The 
upper bounds follows directly by taking pseudo-expectation on equation Q and (fT^ . Fo the 
lower bounds, by taking pseudo-expectation over the SoS equation in Lemma (Sj we have that 
E [p(a:)] < 0(m/n^/^). Taking pseudo-expectation over Claim [H using the assumption that E 
satisfies T(x,x,x) > 1 — r, we have that 

1 - T < E [[T{x, x, x)]^] < E [{ai,x)‘^] + E [pix)] < E [(oj, x)^] -b 0(m,/r?l‘^) (21) 


which implies 

E [(ai,x)^'\ > 1 - T - 0(mln^/‘^). (22) 

For proving the lower bounds in (1^ , we first pseudo-expectation on equation [HI we have that 


E 


y^fai,aj){ai,x)^{aj,x)^ 

i¥=j 


< 0(mf /n^) 


Then taking pseudo-expectation over equation (ITGl) . we obtain that 


1 

(M 

1_ 

< E 

m 

^(ai,x)® 

-bE 

y^fai,aj){ai,x)^{aj,x)^ 

/ 




J¥=j 
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Note that by equation ([2^ and Cauchy-Schwarz, we have 


E 




n x> 


\ i=l 


> E 


^{ai,xY 


_ 2=1 


> 1 — 0{t) — Oirnlr?!'^) 


Combining the two equations above, we obtain that 


E 




_ 2=1 


> 1 — 0(t) — 0{mfn^^‘^) 


□ 


Next we are going to prove that E also satishes the condition of Theorem 5.2 of [BKS15| . 

Lemma 9. For k = 0((logn)/e) with constant e < 1, //E is a degree-k pseudo-expectation that 
satisfies equation O and csi), then there must exists i G [m] such that E[(ai,x)^] > e (2«+'5)^ 
with 5 = 0{mln^/‘^). 

Proof. By equation (2.5) of |BKS15] . we the following SoS version of Holder inequality. For any 
integer t, d and k = t{d — 2), 

\\V\\d ITIlfc ITII 

Let Vi = {ai,x)‘^, we have 


/ m \ ^ m / m 

Vi=l / i=l Vi=l 


x) 


(23) 


By Lemma O we have that with high probability over randomness of afs, ^ 1 + 

0{mjn^/‘^), and it follows that 


< m \ 

< (1+ 0(m/n^/^))* 


V 2=1 


By picking d = 3, we have t = k. Taking t = 0(logm/e) and combining equation ([23]l and 
have that 

( m \ ^ m / ra m 

'^{ai,xf I '^{ai,x)‘^^ ■ I x)^ | ^a: (1 + 0(m/n^/^))^ ^(a*, x)^^ 

i=l J i=l Vi=l / i=l 

Applying pseudo-expectation on both hands, we obtain. 


(24) 


we 



' / \ h~ 

/ m \ ^ 


~ m 

E 

1 

fR 

P 

_^ 

1_ 

< {l + 0{m/n^/‘^)f 

-1 

(N 

4 

_1 


Note that by Cauchy-Schwarz and equation (|20l) . we have 



m 

k 

/ m \ ^ 

VI 

1 

t-H 

J2{ai,xf 

_2=1 

< E 

f ^(ai,x)M 
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(25) 


Combining the two equations above, we obtain that for <5 = 


E 


.i=l 


> ( 1 - 5)^(1 


e)^ 


Therefore by averaging argument, there exists i such that 

E[(a*,x)2^] > (1 - 5f/m = ^-^k-Xo^^-ek 

when k > (logm)/e, we have that E[(ai,x)^^] > 

Lemma 0] follows directly from the two lemmas above. 


□ 


B.2 Proof of Theorem 15.11 

In this section we prove the main theorem in Section [5l 

Theorem B.l. Let T be a tensor chosen from 'Dm,n, when m <C with high probability over 
the randomness of T Algorithmic returns {oj} that is 0.1-close to {oj} in time ^ Further, 

the same guarantee holds given an approximation T where if M £ is an unfolding ofT — T, 

\\M\\ < 1/ioiogn. 

As suggested in the proof sketch, we prove this theorem by induction. The induction hypothesis 
is that all vectors Si £ S are 0.1-close (in £2 norm) to distinct components afs. We break the proof 
into three claims: 

Claim 4. With high probability over the tensor T, suppose all the previously found Si’s are 0.1-close 
(in £2 norm) to some components Oj’s, then there exists a pseudo-expectation that satisfies Step 3 

in Algorithmic 

Proof. We first prove that with high probability T{ai,ai,ai) > 1 — 1/logn for all i. This is easy 
because T{ai,ai,ai) = 1 + ■ Conditioned on Oj, the values {ai,aj) are sub-Gaussian 

random variables with mean 0 and variance 1/n, so by standard concentration bounds we know 
with high probability — —1/logn. We can then take the union bound and conclude 

T{ai,ai,ai) > 1 — 1/logn for all i. 

Now for simplicity of notation, assume that S = {si,...,Si} for some t < m, where Sj is 
0.1-close to ai. We can construct a pseudo-expectation E[p(x)] = p{at+i)- Clearly this is a valid 
pseudo-expectation that satisfies ||x|p = 1. For the inequality constraints we also know {at+i, Si)"^ < 
2((at+i,aj)^ -|- (a^+ijOj — Sj)^) < 1/8 (where the whole proof only uses Cauchy-Schwarz and [A 
BY < 2{A^ -\- B'^), so the proof is SoS). Therefore the system in Step 3 must have a feasible 
solution. □ 

Claim 5. For any valid pseudo-expectation in Step 3, with high probability we get an unit vector c 
that satisfies T{c, c, c) > 0.99, and c is far from all the previously found a* ’s. 

Proof. By Lemma|3]we know there must be a vector o, such that E[(ai,x)^] > e~^^ for sufficiently 
small constant e. We show that this vector Oj cannot be among the previously found ones. By 
Lemma [m we know that for even number k, 

{{Si,x) {si - ai,x)f < - ai,x)^ + {si,x)^) 
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Taking pseudo-expectations over both sides, we have that 

E[{ai,x)’"] <2k -k kE[{si - ai,x)^]) <\\x\2=i^2k 

where we’ve used the constraint {si,x)‘^ < 1/8 and induction hypothesis ||si — aj|| < 0.1. 

Now applying Theorem l5.2l we get a vector c that is has inner-product 1—0(e) with o*. Therefore 
T{c,c,c) = T{ai,ai,ai)+T{c-ai,ai,ai)+T{c,c-ai,ai)+T{c,c,ai) > l-l/logn-3||r||inj||c-aj|| > 
0.99. Here T{x,y,z) = Ylii i 2 is the multilinear form for the tensor, and note 

that this step of the proof does not need to be SoS because we already have the vector c from 
Theorem 15.21 □ 

Claim 6. For any unit vector c such that T(c,c,c) > 0.99, there must be a component a* such that 
lloj — c|| < 0.1. 

Proof. We define the following trivial pseudo-expectation e'^ defined by c: e'^ \p{x)] = p{c). Then 
we know that E does satisfy equation T{x,x,x) > 0.99, and the degree of E can be any finite 
number. Therefore, by Lemma[9l we have that e'^ [(oi,x)*’] > for k = O(logn). Therefore 

using the definition of e'^, we have that e'^ [(oi,x)*’] = {ai,c)’^ > e“(2e-i-<5)fc_ Taking e = 0.001 and 
then we have that {ai,c) > 0.999 — 6 and it follows that ||ai — c|| < 0.99. □ 

These three claims finishes the induction in the noiseless case. For the noisy case, we can handle 
it the same ways as Theorem l4.lt note that \T — T]{x,x,x) ^||a;|| 2 =i,i 2 l/21ogn and this additional 
term does not change any part of the proof. 

Finally, the runtime of Line 3 in Algorithm [2] is and the run-time of line 4 is also 

Therefore the total runtime is 

C Matrix Concentrations 

In this section we introduce theorems used to prove matrix concentrations. First we need the 
following lemma for decoupling the randomness in the sum. 

Theorem C.l (Special case of Theorem 1 of |PMS95| L Let Xi,... ,Xn, Yi,... , 1/1 are independent 
random variables on a measurable space over S, where Xi and Yi has the same distribution for 
z = 1,... ,n. Let fij{-, •) be a family of functions taking S x S to a Banach space {B, || • ||). Then 
there exists absolute constant C, such that for all n > 2, t > 0, 


Pr 



> t 

< CPr 



> t/c 







i¥=j 



We also need the Matrix Bernstein’s Inequality: 

Theorem C.2 (Matrix Bernstein, |Trol2| l. Consider a finite sequence {Xk} of independent, ran¬ 
dom symmetric matrices with dimension d. Assume that each random matrix satisfies 

E[Afc] = 0 and ||Xfc|| < R almost surely. 

Then, for all t > 0, 

Pr[||^Xfc|| >t] < d-exp ^here := \\J2H^k]\\- 
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D Sum-of-Square Proofs 


In this section we state some lemmas that can be proved by low-degree SoS proofs. Most of these 
lemmas can be found in |BS14] and [BKS14] but we still give the proofs here for completeness. 

Lemma 10. [SoS proof for Cauchy-Schwarz] Cauchy-Schwarz inequality can he proved by degree-2 
sum of squares proofs, 




— ^ Ojbi') 

hj 


Lemma 11. For any vector x, y, we have that for even number k, 

\\x + yt^k 2^-Hl|xf+ ||yf) 


Proof. Note that it suffices to prove it for one dimensional vector x, y. We prove by induction. For 
A; = 2, it just follows Cauchy-Schwarz. Suppose it is true for A; — 2 case, we have 

(x + y)^ = ix + yf-^{x + yf P + y^-^) • 2{x^ + y^) 


Note that 

2(x'= + /) - {x^-^ + + y2) = (a;2 _ ^ ^k-&y2 ^ ^ yk-A^ ^ Q 


Combing the two equations above we obtain the desired result. 


Lemma 12. Suppose M is m x n matrix with spectral norm \\M\\, then 


{x^Myf ^4 \\xf\\yf\\Mf 


□ 


Proof. Assume m < n without loss of generality, and suppose M has singular decomposition 
M = UT,V'^ where S = diag(iTi,..., cTm). Let z = x'^U and w = V'^y. Then 


(x^Myf = 


GiZiWi 


^4 


, i=l 



<|| Mf || z || 2||^||2 = ||^|| 2 || y || 2||^||2 


□ 

Lemma 13. For a nonnegative real number a and a set of polynomial R and positive integer k, if 
a polynomial p{x) satisfy p{x) :<R^k of, then p{x) <R^k’ o for k' = max{A;, 2deg(p)}. 

Proof. By a simple manipulation of algebra, we have that 

p{x) - a :<R^k ^{p{x) - of PRk' 0. 

2a 

□ 
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